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Section I 


«/» Introduction ev® 
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Motivation 

Mathematical optimization is the core of Machine Learning, without it we 
wouldn’t be able to find the needle in the haystack of the parameter space. 
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Motivation 

Mathematical optimization is the core of Machine Learning, without it we 
wouldn’t be able to find the needle in the haystack of the parameter space. 


► It materializes in Machine Learning by minimizing an 
objective function such as a divergence or any function that 
penalizes for mistakes of the model; 
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Motivation 

Mathematical optimization is the core of Machine Learning, without it we 
wouldn’t be able to find the needle in the haystack of the parameter space. 


► It materializes in Machine Learning by minimizing an 
objective function such as a divergence or any function that 
penalizes for mistakes of the model; 

► We will talk here about local methods that are characterized by the 
search of an optimal value within a neighboring set of parameter space; 
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Motivation 

Mathematical optimization is the core of Machine Learning, without it we 
wouldn’t be able to find the needle in the haystack of the parameter space. 


► It materializes in Machine Learning by minimizing an 
objective function such as a divergence or any function that 
penalizes for mistakes of the model; 

► We will talk here about local methods that are characterized by the 
search of an optimal value within a neighboring set of parameter space; 

► We have a huge variety of methods that were recently developed, 
therefore this talk is by far from being a comprehensive 
collection. I will focus on intuition and understanding, instead of 
throwing algorithms. 
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Empirical Risk Minimization (ERM) 

► On a supervised setting, we want to find a function or a model /e (•) 
that describes the relationship between a random feature vector x and 
the label target vector y. We assume a joint distribution Pdata(® > ?/); 









Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

00*000000000 oooooooooooooooo oooooooooooooooooooooooooo oooooooooooooooooooo oooooooo 


Empirical Risk Minimization (ERM) 

► On a supervised setting, we want to find a function or a model /e (•) 
that describes the relationship between a random feature vector x and 
the label target vector y. We assume a joint distribution Pdata(® > 2/); 

► We start by defining a loss function L, evaluated as L{fg{x), y) that 
gives us a penalization for the difference between predictions /^(x) 
and the true label y; 
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Empirical Risk Minimization (ERM) 

► On a supervised setting, we want to find a function or a model /e (•) 
that describes the relationship between a random feature vector x and 
the label target vector y. We assume a joint distribution Pdata(® > 2/); 

► We start by defining a loss function L, evaluated as L{fg{x), y) that 
gives us a penalization for the difference between predictions fe{x) 
and the true label y; 

► Now, taking the expectation of the loss we have our risk R: 



that we want to minimize. 
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Empirical Risk Minimization (ERM) 

► However, we don’t know Pdata(®> y)^ we only have access to a sample 
training set P = (rcj, ?/i) ~ Pdatai 
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Empirical Risk Minimization (ERM) 

► However, we don’t know Pdata(®> y)^ we only have access to a sample 
training set P = (rcj, ?/i) ~ Pdatai 

► Therefore, we can approximate the risk with the empirical risk. 
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Empirical Risk Minimization (ERM) 

► However, we don’t know Pdata(®> y)^ we only have access to a sample 
training set P = (rcj, ?/i) ~ Pdatai 

► Therefore, we can approximate the risk with the empirical risk. 



► The Empirical Risk Minimization (ERM) principle says that our 
learning algorithm should minimize the empirical risk; 
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Empirical Risk Minimization (ERM) 

► However, we don’t know Pdata(® > 2/); we only have access to a sample 
training set P = (rcj, ?/i) ~ Pdatai 

► Therefore, we can approximate the risk with the empirical risk: 



► The Empirical Risk Minimization (ERM) principle says that our 
learning algorithm should minimize the empirical risk; 

► The MLE (Maximum Likelihood Estimation) can be posed as a 
special case of ERM where the loss function is the negative 
log-likelihood. 
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Maximum Likelihood Estimation (MLE) 

Under the ERM framework we can describe the MLE cost function J(-) as: 
J{0) = ¥.^,y^p^^-\ogpe{y I x) 

log-likelihood 

where we define the cost as the expectation under the empirical distribution 
Pdataj as we Only have access to a sample training set D = [xt, m) ~ Pdata- 
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Maximum Likelihood Estimation (MLE) 

Under the ERM framework we can describe the MLE cost function J(-) as: 
J{0) = ¥.^,y^p^^-\ogpe{y I x) 

log-likelihood 

where we define the cost as the expectation under the empirical distribution 
Pdataj as we Only have access to a sample training set D = [xt, m) ~ Pdata- 

► We might be interested in let’s say predicting a statistic of the 
distribution, such as the mean of y using the predictor fs{x) 

► Our interest here in terms of optimization is: 

9* = arg min J(9 ), where 0 G M” 

e 
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The global optimum 



e 
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Taylor approximation 


Let’s talk about a powerful calculus tool called Taylor approximation-. 
► Taylor approximation is based on the Taylor theorem': 


m 


f{0o) + V/(0o)(0 - 0o) + 7tVV(0o)(0 - 
^---' ^ 


first-order 


second-order 



where we want an approximation of the function at the point Oq-, 


’Taylor’s theorem gives an approximation of a A:-times differentiable function around 
a given point by a polynomial of degree k. We’re using only up to second-order here. 
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Taylor approximation 


Let’s talk about a powerful calculus tool called Taylor approximation-. 

► Taylor approximation is based on the Taylor theorem': 

h{B) = f{9o) + V/(0o)(0 - 0o) + Jv2/(0o)(0 - Oof, 

'-V-' _^ 

first-order . , 

second-order 

where we want an approximation of the function at the point Oo', 

► This theorem is very powerful as it allows us to approximate any 
differentiable (and twice differentiable) function; 


’Taylor’s theorem gives an approximation of a fc-times differentiable function around 
a given point by a polynomial of degree k. We’re using only up to second-order here. 
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Taylor approximation 

Let’s talk about a powerful calculus tool called Taylor approximation-. 

► Taylor approximation is based on the Taylor theorem': 

h{B) = f{9o) + V/(0o)(0 - 0o) + Jv2/(0o)(0 - Oof, 

'-V-' _^ 

first-order . , 

second-order 

where we want an approximation of the function at the point Oo', 

► This theorem is very powerful as it allows us to approximate any 
differentiable (and twice differentiable) function; 

► The / (•) is also called the Hessian, or Hy. We will talk more 

about it later; 


’Taylor’s theorem gives an approximation of a fc-times differentiable function around 
a given point by a polynomial of degree k. We’re using only up to second-order here. 
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Taylor approximation 

Let’s talk about a powerful calculus tool called Taylor approximation-. 

► Taylor approximation is based on the Taylor theorem': 

h{B) = f{9o) + V/(0o)(0 - 0o) + Jv2/(0o)(0 - Oof, 

'-V-' _^ 

first-order . , 

second-order 

where we want an approximation of the function at the point Oo', 

► This theorem is very powerful as it allows us to approximate any 
differentiable (and twice differentiable) function; 

► The / (•) is also called the Hessian, or H ^. We will talk more 

about it later; 

► We will understand the deep connection of this approximation with 
Gradient Descent. 

’Taylor’s theorem gives an approximation of a fc-times differentiable function around 
a given point by a polynomial of degree k. We’re using only up to second-order here. 
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Taylor approximation 



e 
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Taylor approximation 
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Taylor approximation 
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Taylor approximation 



e 









Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

ooooooo»oooo oooooooooooooooo oooooooooooooooooooooooooo oooooooooooooooooooo oooooooo 


Taylor approximation 
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Taylor approximation 
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Taylor approximation 
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Taylor approximation 



e 








Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

ooooooo»oooo oooooooooooooooo oooooooooooooooooooooooooo oooooooooooooooooooo oooooooo 


Taylor approximation 
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Taylor approximation 
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Taylor approximation 
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Taylor approximation 
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Taylor approximation 
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Taylor APPROXIMATION inJax 

from jax import grad 

def taylor_first_order(0, ^o)- 

return f(0o) + grad(f)(0o)*(^ “ ^o) 

def taylor_second_order (0, Oq ): 

dl = taylor_first_order(0, 9q) 

d 2 = 1 ./ 2 . * grad(grad(f)) (0o) * (.0 - a )**2 

return dl + d2 


Do not use greek symbols on your Python code, your colleagues will curse you. 
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Taylor APPROXIMATION inJax 

from jax import grad 

def taylor_first_order(0, ^o)- 

return f(0o) + grad(f)(0o)*(^ “ ^o) 

def taylor_second_order (0, Oq ): 

dl = taylor_first_order(0, 9q) 

d 2 = 1 ./ 2 . * grad(grad(f)) (0o) * (.0 - a )**2 

return dl + d2 


>>> taylor_first_order(6.01, 6.0) 
33.421864 

>>> taylor_second_order (6.01 , 6.0) 
33.422104 

>>> taylor_first_order(6.5, 6.0) 
44.0067 

>>> taylor_second_order(6.5, 6.0) 
44.60597 


Do not use greek symbols on your Python code, your colleagues will curse you. 
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Linear approximation plane 


Tangent plane at 



Source: Tangent Planes and Linear Approximations. Calculus Volume^. 

Rice University. 2020. Creative Commons Attribution 4.0 International License. 
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Local approximation and second-order 


► Let’s now think about that second-order term: 

h{e) = f{9o) + v/(0o)(0 - 0o) + - eof, 

-V-^ ^ _, 

first-order , , 

second-order 

► If we do a small step from Oq, what happens with the second-term ? 
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Local approximation and second-order 


► Let’s now think about that second-order term: 

h{e) = f{9o) + v/(0o)(0 - 0o) + - eof, 

-V-^ ^ _, 

first-order , , 

second-order 

► If we do a small step from Oq, what happens with the second-term ? 
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The steepest descent 

► Even if / (•) is very complex, locally it is simple, and we can use a 
simple function to approximate it, a linear function: 

h{9)-f{eo) + vf{eo){e-eo) 

' -V-' 

first-order 


► This is also called linearization-. 
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The steepest descent 

► Even if / (•) is very complex, locally it is simple, and we can use a 
simple function to approximate it, a linear function: 

h{9)-f{eo) + vf{eo){e-eo) 

' -V-' 

first-order 

► This is also called linearization-, 

► It is already apparent what we need now. How can we guarantee, 
locally, that we can always minimize the function (reduce the loss) ? 








Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

ooooooooooo# oooooooooooooooo oooooooooooooooooooooooooo oooooooooooooooooooo oooooooo 


The steepest descent 

► Even if / (•) is very complex, locally it is simple, and we can use a 
simple function to approximate it, a linear function: 

h{9)-f{eo) + vf{eo){e-eo) 

' -V-' 

first-order 

► This is also called linearization-, 

► It is already apparent what we need now. How can we guarantee, 
locally, that we can always minimize the function (reduce the loss) ? 

► We can just follow the slope (negative) of the approximation that is 
given by-V/(6'o); 
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The steepest descent 

► Even if / (•) is very complex, locally it is simple, and we can use a 
simple function to approximate it, a linear function: 

h{9)-f{eo) + vf{eo){e-eo) 

' -V-' 

first-order 

► This is also called linearization-, 

► It is already apparent what we need now. How can we guarantee, 
locally, that we can always minimize the function (reduce the loss) ? 

► We can just follow the slope (negative) of the approximation that is 
given by-V/(6'o); 

► No twice differentiability requirement, less computational resources; 
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Section II 


«/» Gradient Descent cv® 
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Gradient Descent 


Algorithm The general gradient descent algorithm. 


Input: initial weights iterations T, learning rate r] 
Output: final weights 

1. for i = 0 to T — 1 

2 . compute 

4 . return 0(T) 
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Gradient Descent 


Algorithm The general gradient descent algorithm. 


Input: initial weights iterations T, learning rate r] 
Output: final weights 

1. for i = 0 to T — 1 

2 . compute 

4 . return 0(T) 


The important part here is the iterative rule: 


^ 0{t) _ ^VL(6»W) 


How much we move 
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Gradient Descent - Loss sureace 



4 
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Gradient Descent - Loss sureace 
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Gradient Descent - Loss sureace 



4 






Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

OOOOOOOOOOOO 00000*0000000000 oooooooooooooooooooooooooo oooooooooooooooooooo oooooooo 


Gradient Descent - Loss sureace 
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Gradient Descent - Loss sureace 
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Gradient Descent - Loss sureace 



4 
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Learning E^te 


a = 0.00 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a = 0.13 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a = 0.19 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


a = 0.25 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


a = 0.31 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a = 0.38 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a = 0.50 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a = 0.56 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a = 0.63 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


a = 0.69 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a = 0.75 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a = 0.81 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a = 0.88 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


a= 1.00 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


a= 1.06 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


a= 1.13 



Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


a= 1.19 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a= 1.25 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


g{w) 


a= 1.31 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


a= 1.38 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


a = 1.44 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 
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Learning E^te 


a= 1.50 




Source: Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. Creative Commons 
Attribution 4.0 International License. Used with permission from the authors. 









Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

OOOOOOOOOOOO 000000000*000000 oooooooooooooooooooooooooo oooooooooooooooooooo oooooooo 


High curvatures 

Gradient descent can suffer on some pathological curvatures and cause a lot 
of oscillations: 


1 


01 0 


-1 



Source: Code adapted from Machine Learning Refined. Jeremy Watt and Reza Borhani. 2020. 
Creative Commons Attribution 4.0 International License. 
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Momentum 

Momentum is a method to damp out oscillations: 

Vanilla gradient descent: 

Q{t+1) ^ Q(t) _ J^VL(6»W) 
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Momentum 

Momentum is a method to damp out oscillations: 

Vanilla gradient descent: 

Q{t+1) ^ Q(t) _ J^VL(6»W) 

Momentum: 

uh+i) = /3 uW+VL((9W) 

Constant 

Q(t+1) ^ git) _ ^ 

Momentum buffer 
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Momentum 

Momentum is a method to damp out oscillations: 

Vanilla gradient descent: 

Q{t+1) ^ Q(t) _ J^VL(6»W) 

Momentum: 

uh+i) = /3 uW+VL((9W) 

Constant 

Q(t+1) ^ git) _ ^ 

Momentum buffer 


► Momentum works by acceleration and smoothing, it makes the 
trajectories to take more time to react to changes in the loss landscape; 
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Momentum 

Momentum is a method to damp out oscillations: 

Vanilla gradient descent: 

Q{t+1) ^ Q(t) _ J^VL(6»W) 

Momentum: 

uh+i) = /3 uW+VL((9W) 

Constant 

Q(t+1) ^ git) _ ^ 

Momentum buffer 


► Momentum works by acceleration and smoothing, it makes the 
trajectories to take more time to react to changes in the loss landscape; 

► Note that with /3 = 0 we recover vanilla Gradient descent; 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 





Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 





Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 





Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 




ei 



Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 





Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 



@1 0 




Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 




Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 





Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 





Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum ' 

01 0 

/3 = 0.0 


-2 

0 2 4 6 8 10 

00 

^ = 0.1 ° 
-1 


-2 

o - 

^ = 0.7 “ 

-1 • 


-2 



00 

Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 


01 





00 


Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 


01 0 





00 


Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 





1 


0 


• 1 


•2 


0 


2 


4 


6 


10 


do 


Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 





1 


0 


• 1 


•2 


0 


2 


4 


6 


10 


do 


Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 

/3 = 0.0 


(5 = 0.1 


(3 = 0.7 





1 


0 


• 1 


•2 


0 


2 


4 


6 


10 


do 


Source: Code adapted from Machine Learning Refined. Jeremy Wattet 
al. 2020. 
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Momentum 


Pause for a quick demo from Lili Jiang, from: 

https://github.com/lilipads/gradient_descent_viz 
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Stochastic Gradient Descent (SGD) 

It turns out that we don’t quite need to compute the gradients VL(0) over 
the whole dataset at every iteration of Gradient descent: 

0{t+i) ^ 0{t) _ ^ 

Individual samples 

where we do random sampling (or not, we can stratify too, in practice it can 
lead to better results) of individual samples i at every step. 


'Robbins and Monro, “A Stochastic Approximation Method’', 1951 
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Stochastic Gradient Descent (SGD) 

It turns out that we don’t quite need to compute the gradients VL(0) over 
the whole dataset at every iteration of Gradient descent: 

0{t+i) ^ 0{t) _ ^ 

Individual samples 

where we do random sampling (or not, we can stratify too, in practice it can 
lead to better results) of individual samples i at every step. 

► Much more efficient (don’t have to compute gradient for entire 
dataset); 

► Noise (can be beneficial); 

► Lots of redundancy on real datasets; 

► Highly correlation at early steps (similar gradients SGD vs GD); 
SGD can be traced back to 1950 s work on the Robbins-Monro algorithm 

^Robbins and Monro, Stochastic Approximation Method’', 1951 
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Graphics Processing Unit (GPUs) 


Most of the operations in Machine Learning ends up being lowered to 
GEMM {General Matrix Multiplication) and MAC 
{Multiply-accumulate operation) operations. 

To leverage these massively parallel engines, we need to provide enough data 
to take advantage of the parallelization potential. 


Thread Block 0 _ Thread Block 1 


Shaied Memory 

Shared Herrrory 

t 

Thread 0 

« 

local 1 
Merrvocy ' 


Thread 0 

Memory 

% 

... JU 

Memory 1 

^ ^ ^ Thread n 

••• Z; 


■ 



K 


Global Memory 


Source: Standard GPU memory hierarchy. By Giacomo Parigi. 
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Mini-batch SGD 

That’s why using mini-batches instead of individual samples on SGD is a 
perfect marriage of having better gradient estimates together with improved 
parallelization: 


VL(0W)= ^ 

Batch size 

0{t+i) ^ Qit) _ ^ VL(6i(*)) 


Estimated gradients 
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Mini-batch SGD 

That’s why using mini-batches instead of individual samples on SGD is a 
perfect marriage of having better gradient estimates together with improved 
parallelization: 


VL(0W) 




1 

\B 

Batch size 

0W-77 VL(0W) 

Estimated gradients 


ieB 


If we do random sampling, then: 

E[VL(0W)] = VL(0) 


Unbiased estimate 







Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

OOOOOOOOOOOO OOOOOOOOOOOOOOOO #0000000000000000000000000 oooooooooooooooooooo oooooooo 


Section III 


«/» Adaptation AND 
Preconditioning ev® 
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Adaptive Moment Estimation (Adam) 

There are many adaptive methods, we will focus on one of the most 
frequently used in Deep Learning, Adaptive Moment Estimation also 
called Adam. 


Kingma and Ba, “Adam: a Method for Stochastic Optimization ”, 2015 
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Adaptive Moment Estimation (Adam) 

There are many adaptive methods, we will focus on one of the most 
frequently used in Deep Learning, Adaptive Moment Estimation also 

called Adam. 

► Single learning rate for all parameters of the network doesn’t seem to 
be enough to cope with the growing complexity of Deep Learning 
architectures; 


Kingma and Ba, “Adam: a Method for Stochastic Optimization ”, 2015 
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Adaptive Moment Estimation (Adam) 

There are many adaptive methods, we will focus on one of the most 
frequently used in Deep Learning, Adaptive Moment Estimation also 
called Adam. 

► Single learning rate for all parameters of the network doesn’t seem to 
be enough to cope with the growing complexity of Deep Learning 
architectures; 

► What we can do is often bounded by what we can optimize, therefore 
better optimization techniques that explores structure are paramount; 


Kingma and Ba, “Adam: a Method for Stochastic Optimization ”, 2015 
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Adaptive Moment Estimation (Adam) 

There are many adaptive methods, we will focus on one of the most 
frequently used in Deep Learning, Adaptive Moment Estimation also 

called Adam. 

► Single learning rate for all parameters of the network doesn’t seem to 
be enough to cope with the growing complexity of Deep Learning 
architectures; 

► What we can do is often bounded by what we can optimize, therefore 
better optimization techniques that explores structure are paramount; 

► Most of the adaptive methods adapt to some kind of structure or 
curvature of the optimization landscape; 


Kingma and Ba, “Adam: a Method for Stochastic Optimization ”, 2015 
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Adaptive Moment Estimation (Adam) 

There are many adaptive methods, we will focus on one of the most 
frequently used in Deep Learning, Adaptive Moment Estimation also 

called Adam. 

► Single learning rate for all parameters of the network doesn’t seem to 
be enough to cope with the growing complexity of Deep Learning 
architectures; 

► What we can do is often bounded by what we can optimize, therefore 
better optimization techniques that explores structure are paramount; 

► Most of the adaptive methods adapt to some kind of structure or 
curvature of the optimization landscape; 

► Many of these algorithms are still not well understood, lots of folklore 
in the field; 


Kingma and Ba, “Adam: a Method for Stochastic Optimization ”, 2015 
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Adaptive Moment Estimation (Adam) 

There are many adaptive methods, we will focus on one of the most 
frequently used in Deep Learning, Adaptive Moment Estimation also 

called Adam. 

► Single learning rate for all parameters of the network doesn’t seem to 
be enough to cope with the growing complexity of Deep Learning 
architectures; 

► What we can do is often bounded by what we can optimize, therefore 
better optimization techniques that explores structure are paramount; 

► Most of the adaptive methods adapt to some kind of structure or 
curvature of the optimization landscape; 

► Many of these algorithms are still not well understood, lots of folklore 
in the field; 

► Will try to focus on building intuition from the original algorithm. 


Kingma and Ba, “Adam: a Method for Stochastic Optimization ”, 2015 
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Adaptive Moment Estimation (Adam) 


Algorithm 9t ~ 9t ® 9t- Good defaults: a = 0.001, /3i = 0.9, /?2 = 0.999 and 
e = 10^®. /3( and are /3i and P 2 to the power t. 

Require: G [0,1): Exponential decay rates for the moment estimates 

Require: f(6): Stochastic objective function with parameters 6 
Require: d^: Initial parameter vector, a: Stepsize 
mg ■<— 0 (Initialize moment vector) 

Do 0 (Initialize moment vector) 
t <— 0 (Initialize timestep) 
while 9t not converged do 
t i — i “h 1 

gt ^eft{dt-i) (Get gradients w.r.t. stochastic objective at timestep t) 
nit <— l^i ■ nit-i + (1 — /9i) • gt (Update biased first moment estimate) 

■i— /92 • I't-i + (1 — P 2 ) ■ gt (Update biased second raw moment estimate) 
fht -1^ mt/(1 — P\) (Compute bias-corrected first moment estimate) 
tit ■i— t't/(1 — P 2 ) (Compute bias-corrected second raw moment estimate) 

6t ■<— 6t-i — a ■ mtj{\/%, + e) (Update parameters) 
end while 

return 9t (Resulting parameters) 








Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

OOOOOOOOOOOO OOOOOOOOOOOOOOOO 000*0000000000000000000000 oooooooooooooooooooo oooooooo 


Adaptive Moment Estimation (Adam) 

Lots of things going on here, let’s focus on how moments are being 
computed and neglect bias correction and initialization: 

gt ^ V,/t(0i_i) 
mt^ I3i- rrit-i + (1 - A) • gt 
vt^ 132- vt-i + (1 - /32) ■ gl 


And the parameter updates: 


fht 


9t 6t-i — oi ■ 


% + t 
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Adaptive Moment Estimation (Adam) 

Lots of things going on here, let’s focus on how moments are being 
computed and neglect bias correction and initialization: 

gt ^ V,/t(0i_i) 
mt^ I3i- rrit-i + (1 - A) • gt 
vt^ 132- vt-i + (1 - /32) ■ gl 


And the parameter updates: 


6t Ot-i — a ' 


mt 


'vt + e 


► Do you recognize mt ? 

► What happens when the uncentered variance grows ? 
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The good, the bad, and the Hessian 

► The convergence rate of Gradient descent is deeply connected to the 
curvature of the landscape it is trying to optimize; 
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The good, the bad, and the Hessian 

► The convergence rate of Gradient descent is deeply connected to the 
curvature of the landscape it is trying to optimize; 

► The Hessian matrix Hy carries information about the curvature, 
therefore we usually use it understand problems or even make them 
better conditioned; 
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The good, the bad, and the Hessian 

► The convergence rate of Gradient descent is deeply connected to the 
curvature of the landscape it is trying to optimize; 

► The Hessian matrix Hy carries information about the curvature, 
therefore we usually use it understand problems or even make them 
better conditioned; 

► TheHy is often very costly to compute for real-life problems, 
therefore much of the work rely on approximating it or computing 
information about it without having to materialize the entire matrix; 
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Hessian 

The H/ is a square matrix of ind-order partial derivatives. Let’s compute 
the Hj of f{x, y) = x‘^y + xy^, starting with first-order; 


dl 

dx 


2xy + y^ 


dl 

dy 


+ 3xy^ 


Note that the H f can be constant and not depend on variables or depend only on 
some of them. We will see this case later. 
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Hessian 

The H/ is a square matrix of ind-order partial derivatives. Let’s compute 
the Hj of f{x, y) = x‘^y + xy^, starting with first-order; 


dl 

dx 


2xy + y^ 


Second order 




= 2?/ , 


dydx 


2x + 3y^ 


dl 

dy 


+ 3xy^ 


dxdy 


2x + 3y^ 


u_ 

dy"^ 


= 6xy 


Note that the H f can be constant and not depend on variables or depend only on 
some of them. We will see this case later. 
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Hessian 

The H/ is a square matrix of ind-order partial derivatives. Let’s compute 
the Hj of f{x, y) = x‘^y + xy^, starting with first-order; 


df o 

— = 2xy + y^ 
ox 


Second order 


_ 2 ?/ 

dx'^ ’ dydx 




^ + Sxi/ 

dy 


o o 2 d'^f r. 


Hessian 


H, 


■ d^f 
dx'^ 

d^f 1 

dydx 


2y 

2x + 

d^f 

dxdy 

8^ 
dy'^ _ 


2x + 3?/^ 

6xy 


Note that the H f can be constant and not depend on variables or depend only on 
some of them. We will see this case later. 
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Hessian Eigenvalues 



All positive eigenvalues 
(positive definite) 


0 


2 
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Hessian Eigenvalues 


All positive eigenvalues 
(positive definite) 


All negative eigenvalues 
(negative definite) 
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Condition Number 

The Condition number, also defined as k, is the ratio of maximum and 
minimum eigenvalues (A^ax and Amin) of the Hessian Hj: 




Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

OOOOOOOOOOOO OOOOOOOOOOOOOOOO 0000000*000000000000000000 oooooooooooooooooooo oooooooo 


Condition Number 


The Condition number, also defined as k, is the ratio of maximum and 
minimum eigenvalues (A^ax and Amin) of the Hessian Hj: 


Amax 

^min 


► When K is high we say that the problem is ill-conditioned; 
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Condition Number 

The Condition number, also defined as k, is the ratio of maximum and 
minimum eigenvalues (A^ax and Amin) of the Hessian Hj: 

Amax 

^min 

► When K is high we say that the problem is ill-conditioned; 

► Steepest descent convergence rate is slow for ill-conditioned problems; 





Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

OOOOOOOOOOOO OOOOOOOOOOOOOOOO 0000000*000000000000000000 oooooooooooooooooooo oooooooo 


Condition Number 

The Condition number, also defined as k, is the ratio of maximum and 
minimum eigenvalues (A^ax and Amin) of the Hessian Hj: 

Amax 

^min 

► When K is high we say that the problem is ill-conditioned; 

► Steepest descent convergence rate is slow for ill-conditioned problems; 

► Let’s understand it on a quadratic problem to gain intuition. 
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Condition Number 

/W = i^l + i^2 



K = 2.00 {Xn^ax = 2.0, = 1.0) 
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Condition Number 



K = 1.33 {\„iax = 2.0, An 


1.5) 
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Condition Number 


m 


2.0/t I 2.0/g 

2 . 0^1 + 9 


2 . 0 ^ 


2 



K = 1.00 {Xmax = 2.0, = 2.0) 












Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

oooooooooooo oooooooooooooooo oooooooo«ooooooooooooooooo oooooooooooooooooooo oooooooo 


Condition Number 


m = Wi + m 



K — 1.25 {Xmax — 2.5, Aj 7 


2 . 0 ) 
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Condition Number 





K = 1.50 (A max 3.0, Xfnin 2.0) 
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Condition Number 





H/ 


3.5 0.0 
0.0 2.0 


K = 1.75 (A 

max 


3.5, A„ 


2 . 0 ) 
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Condition Number 


m 


4.0/t I 2.0/3 

+ Tr)^2 


2xr 



H/ 


4.0 0.0 
0.0 2.0 


K = 2.00 {Xn^ax = 4 . 0 , = 2 . 0 ) 
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Condition Number 


/w = i^i+i^2 



H/ 


4.5 0.0 

0.0 2.0 


K, — 2.25 (Xmax — 4.5, Xmin — 2.0) 
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Condition Number 





H/ 


5.0 0.0 
0.0 2.0 


K = 2.50 (A max 5.0, Xfriin 2.0) 
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Condition Number 





H/ 


5.5 0.0 
0.0 2.0 


K = 2.75 (A max 5.5, Xjjiin 2.0) 
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Hessian eigenvalue spectral density (ESD) 



Epoch Epoch Epoch 


Source: Yao, Z., Gholami, A., Keutzer, K., ^Mahoney, M. W. (loiy, December i$). PYHESSIAN: 
Neural networks through the lens of the hessian. 


ResNet with depth 20 trained on Cifar-io. ResNet_BN is the ResNet 
without Batch Normalization and the ResNet_Res is without the residual 
connections. In they also show that the distribution seem to composed of 
two parts: the bulk around zero, and the edges scattered away from zero. 


Sagun, Leon Bottou, and LeCun, 
Beyond”, 2016 


‘Eigenvalues of the Hessian in Deep Learning: Singularity and 
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Preconditioning 


From Adam’s original paper; 


(...) Like natural gradient descent (NGD) Adam employs a preconditioner 
that adapts to the geometry of the data, since Vt is an approximation to the 

diagonal of the Fisher information matrix 


► Preconditioning can be viewed as a change in the geometry; 


^Amari, “Natural Gradient Works Efficiently in Learning”, 1998 
^Pascanu and Bengio, “Revisiting Natural Gradient for Deep Networks’’^ 2013 







Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

oooooooooooo oooooooooooooooo oooooooooo«ooooooooooooooo oooooooooooooooooooo oooooooo 


Preconditioning 


From Adam’s original paper; 


(...) Like natural gradient descent (NGD) Adam employs a preconditioner 
that adapts to the geometry of the data, since Vt is an approximation to the 

diagonal of the Fisher information matrix 


► Preconditioning can be viewed as a change in the geometry; 

► It can help with poorly conditioned problems; 


^Amari, “Natural Gradient Works Efficiently in Learning’\ 1998 
^Pascanu and Bengio, “Revisiting Natural Gradient for Deep Networks’\ 2013 







Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

oooooooooooo oooooooooooooooo oooooooooo«ooooooooooooooo oooooooooooooooooooo oooooooo 


Preconditioning 


From Adam’s original paper; 


(...) Like natural gradient descent (NGD) Adam employs a preconditioner 
that adapts to the geometry of the data, since Vt is an approximation to the 

diagonal of the Fisher information matrix 


► Preconditioning can be viewed as a change in the geometry; 

► It can help with poorly conditioned problems; 

► We will talk about the Fisher Information Matrix (FIM) later; 


^Amari, “Natural Gradient Works Efficiently in Learning’\ 1998 
^Pascanu and Bengio, “Revisiting Natural Gradient for Deep Networks’\ 2013 
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Preconditioning 


Q{t+1) _ 


-r/VL(0W) 
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Preconditioning 


Q{t+i) ^ Q(t) 


Gradients 
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Preconditioning 




0^^^ — rj 





Preconditioner 


Gradients 
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Preconditioning 


Q{t+1) ^ Q{t) ^ VL(0W) 

Identity Gradients 
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Preconditioning 




0{t) 


-V 


1 0 0 
0 1 0 
0 0 1 


Identity 


^ 


Gradients 
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Preconditioning 


Q{t+i) ^ Q{t) _ jj-1 

Hessian Gradients 

► Can be interpreted as an iterative minimization of the quadratic 
approximation, we’re using a ind-order term here, remember the 
Taylor approximation ? 


The superscript t was omitted from the ^ for clarity. 
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Preconditioning 


0{t+i) ^ Qit) _ ^ vL(0W) 

Damped Hessian Gradients 


The superscript t was omitted from the H l for clarity. 
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Preconditioning 


/(0)=i01 + i02 



K = 2.50 {Xrnax = 5.0, X,nin = 2.0) 
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Preconditioning 


/(0)=i01 + i02 



K = 2.50 {Xrnax = 5.0, Xmin = 2.0) 


tioning Natural Gradient Thoughts 

>•0000000 oooooooooooooooooooo oooooooo 


Let’s think about what the 
preconditioner is doing in this 
situation, we have a point 
0 £ at 0 = (0.5,0.5) and 
we have that: 











Introduction Gradient Descent Adaptation and Prec 

oooooooooooo oooooooooooooooo ooooooooooooooc 


Preconditioning 


/(0)=i01 + i02 



K = 2.50 {Xrnax = 5.0, Xmin = 2.0) 


tioning Natural Gradient Thoughts 

>•0000000 oooooooooooooooooooo oooooooo 


Let’s think about what the 
preconditioner is doing in this 
situation, we have a point 
0 £ at 0 = (0.5,0.5) and 
we have that: 


V/(0) = (2.5,1.0) 


H/ 


5.0 0.0 

0.0 2.0 


9-Vf{9) = {-2.,-0.5) 
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Hessian AS preconditioner 

/W = i^l + i^2 



K = 2.00 {Xn^ax = 2.0, = 1.0) 
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Hessian AS preconditioner 



K = 1.33 {\„iax = 2.0, An 


1.5) 
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Hessian AS preconditioner 




H/ 


2.0 0.0 

0.0 2.0 



K = 1.00 {Xmax = 2.0, = 2.0) 
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Hessian AS preconditioner 


m = Wi + m 


H/ 


2.5 0.0 
0.0 2.0 



K — 1.25 {Xmax — 2.5, Aj 7 


2 . 0 ) 
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Hessian AS preconditioner 





K = 1.50 (A max 3.0, Xfnin 2.0) 
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Hessian AS preconditioner 


/w = i^i+i^2 



H, 


3.5 0.0 

0.0 2.0 


^ — 1.75 {Xmax 


3.5, A„ 


2 . 0 ) 
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Hessian AS preconditioner 





K = 2.00 {Xn^ax = 4.0, = 2.0) 
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Hessian AS preconditioner 


/w = i0i+m 


H, 


4.5 

0.0 


0.0 

2.0 



K, — 2.25 (Xmax — 4.5, Xmin — 2.0) 
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Hessian AS preconditioner 





H/ 


5.0 0.0 
0.0 2.0 


K = 2.50 (A max 5.0, Xfnin 2.0) 
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Difficulties of the Hessian preconditioning 

► Using the Hessian as preconditioner is the basis of the Newton’s 
method; 


"Dauphin et al., “Identifying and attacking the saddle point problem in high-dimensional 
non-convex optimization ", 2014 

Yao et al., PYHESSIAN: Neural networks through the lens of the hessian, 2019 
'^Martens, Deep learning via Hessian-free optimization, 2010 
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Difficulties of the Hessian preconditioning 

► Using the Hessian as preconditioner is the basis of the Newton’s 
method; 

► Invariant to affine transformations; 


"Dauphin et al., “Identifying and attacking the saddle point problem in high-dimensional 
non-convex optimization ", 2014 

Yao et al., PYHESSIAN: Neural networks through the lens of the hessian, 2019 
'^Martens, Deep learning via Hessian-free optimization, 2010 
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Difficulties of the Hessian preconditioning 

► Using the Hessian as preconditioner is the basis of the Newton’s 
method; 

► Invariant to affine transformations; 

► However, a model with 23 million parameters (i.e. ResNet-50), what 
is the space complexity to store the Hy and the computational 
complexity to invert it ? 


"Dauphin et al., “Identifying and attacking the saddle point problem in high-dimensional 
non-convex optimization ", 2014 

Yao et al., PYHESSIAN: Neural networks through the lens of the hessian, 2019 
'^Martens, Deep learning via Hessian-free optimization, 2010 
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Difficulties of the Hessian preconditioning 

► Using the Hessian as preconditioner is the basis of the Newton’s 
method; 

► Invariant to affine transformations; 

► However, a model with 23 million parameters (i.e. ResNet-50), what 
is the space complexity to store the Hy and the computational 
complexity to invert it ? 

► Difficult on non-convex problems, not always invertible, attracted by 
saddle points 


"Dauphin et al., “Identifying and attacking the saddle point problem in high-dimensional 
non-convex optimization ”, 2014 

Yao et al., PYHESSIAN: Neural networks through the lens of the hessian, 2019 
'^Martens, Deep learning via Hessian-free optimization, 2010 
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Difficulties of the Hessian preconditioning 

► Using the Hessian as preconditioner is the basis of the Newton’s 
method; 

► Invariant to affine transformations; 

► However, a model with 23 million parameters (i.e. ResNet-50), what 
is the space complexity to store the Hy and the computational 
complexity to invert it ? 

► Difficult on non-convex problems, not always invertible, attracted by 
saddle points 

► Among other reasons, you now understand all the efforts into 
Hessian approximations alternative curvature matrices and 
hessian-free optimization 

"Dauphin et al., “Identifying and attacking the saddle point problem in high-dimensional 
non-convex optimization ”, 2014 

Yao et al., PYHESSIAN: Neural networks through the lens of the hessian, 2019 

'^Martens, Deep learning via Hessian-free optimization, 2010 
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Saddle points 
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Fisher Information Matrix (FIM) 

Going back to the Adam’s article: 


From Adam’s original paper; 


(...) Like natural gradient descent (NGD) Adam employs a preconditioner 
that adapts to the geometry of the data, since Vt is an approximation to the 

diagonal of the Fisher information matrix 'h (• • •) 


► We now know what a preconditioner means; 

► The missing ingredient now is the Fisher Information Matrix (also 
known as FIM). 


^'^Ama.ri, “Natural Gradient Works Efficiently in Learning’\ 1998 
’^Pascanu and Bengio, “Revisiting Natural Gradient for Deep Networks”, 2013 
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Fisher Information Matrix (FIM) 

The Fisher Information Matrix is the covariance of the score function 
(gradients of the log-likelihood function) with expectation over the model’s 
predictive distribution (pay attention to this detail). 


Definition; Fisher Information Matrix 


Fe = E [Velogpe(2/|x) Velogpe(t/|a;)T] 

y~pe(y|a:) 

■^~Pdata 


Where Fe e 
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Fisher Information Matrix (FIM) 

The Fisher Information Matrix is the covariance of the score function 
(gradients of the log-likelihood function) with expectation over the model’s 
predictive distribution (pay attention to this detail). 


Definition; Fisher Information Matrix 


Fe = E [Velogpe(2/|x) Velogpe(t/|a;)T] 

y~pe(y|a:) 

■^~Pdata 

Where F^i G We often approximate it using input samples (y is still 

from model’s predictive distribution), as we don’t have access to Pdata- 

1 ^ 

e'^ogpg{y\xi)V e\ogp0{y\xiy 

i=l 
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Kullback-Leibler divergence 

KL[P II Q] = 5683.243 
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Kullback-Leibler divergence 

KL[P II Q] = 3488.456 
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Kullback-Leibler divergence 

KL[P II Q] = 1842.365 
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Kullback-Leibler divergence 

KL[P II Q] = 744.971 
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Kullback-Leibler divergence 

KL[P||Q] = 196.274 
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Kullback-Leibler divergence 

KL[P||Q] = 196.274 
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Kullback-Leibler divergence 

KL[P II Q] = 744.971 
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Kullback-Leibler divergence 

KL[P II Q\ = 1842.365 
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Kullback-Leibler divergence 

KL[P II Q] = 3488.456 
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Kullback-Leibler divergence 

KL[P II Q] = 5683.243 
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Fisher Information Matrix (FIM) 

► We can parametrize the same distribution family on many different 
ways; 

► Moving in the parameter space using the Euclidean distance as a 
metric makes us tied to the particular parametrization; 


‘^For a full derivation please refer to: Ratliff, N. (2.013). Information Geometry and 
Natural Gradients. 

^^Martens, “New insights and perspectives on the natural gradient method”, 2014 
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Fisher Information Matrix (FIM) 

► We can parametrize the same distribution family on many different 
ways; 

► Moving in the parameter space using the Euclidean distance as a 
metric makes us tied to the particular parametrization; 

► One interesting way is to move on the distribution space, to which 
the KL divergence makes more sense; 


‘^For a full derivation please refer to: Ratliff, N. (2.013). Information Geometry and 
Natural Gradients. 

^^Martens, “New insights and perspectives on the natural gradient method”, 2014 
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Fisher Information Matrix (FIM) 

► We can parametrize the same distribution family on many different 
ways; 

► Moving in the parameter space using the Euclidean distance as a 
metric makes us tied to the particular parametrization; 

► One interesting way is to move on the distribution space, to which 
the KL divergence makes more sense; 

► It turns out that the second-order Taylor approximation to the KL 
divergence is the Fisher Information Matrix'^; 


‘^For a full derivation please refer to: Ratliff, N. (2.013). Information Geometry and 
Natural Gradients. 

^^Martens, “New insights and perspectives on the natural gradient method”, 2014 
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Fisher Information Matrix (FIM) 

► We can parametrize the same distribution family on many different 
ways; 

► Moving in the parameter space using the Euclidean distance as a 
metric makes us tied to the particular parametrization; 

► One interesting way is to move on the distribution space, to which 
the KL divergence makes more sense; 

► It turns out that the second-order Taylor approximation to the KL 
divergence is the Fisher Information Matrix'^; 

► We won’t be talking here, but the Fisher has a strong connection to 
the Hessian and the Generalized Gauss-Newton (GGN), please refer 
to if you are interested. 


‘^For a full derivation please refer to: Ratliff, N. (2.013). Information Geometry and 
Natural Gradients. 

^^Martens, “New insights and perspectives on the natural gradient method”, 2014 
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Section IV 


«/» Natural Gradient evs 
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Natural Gradient 

When we do a preconditioning on Gradient descent using the Fisher, we 
have the Natural Gradient Descent 

0{t+i) ^ 0it) -rjFg^VL{9^^^) 

FIM Gradients 


'^Amari, “Natural Gradient Works Efficiently in Learning’\ 1998 

^^Leon Bottou, Curtis, and Nocedal, Optimization methodsfor large-scale machine learnings 2018 
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► It converges much faster than ordinary Gradient descent; 

► It moves on the distribution space manifold, invariant with respect to 
all differentiable and invertible transformations 


'^Amari, “Natural Gradient Works Efficiently in Learning’\ 1998 
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Natural Gradient 

When we do a preconditioning on Gradient descent using the Fisher, we 
have the Natural Gradient Descent 

0{t+i) ^ 0it) -rjFg^VL{9^^^) 

FIM Gradients 

► It converges much faster than ordinary Gradient descent; 

► It moves on the distribution space manifold, invariant with respect to 
all differentiable and invertible transformations 

► Given that the FIM is the result of an outer-product, it is always PSD 
(positive semidefinite matrix); 


'^Amari, “Natural Gradient Works Efficiently in Learning’\ 1998 

‘^Leon Bottou, Curtis, and Nocedal, Optimization methodsfor large-scale machine learnings 2018 
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Natural Gradient 

When we do a preconditioning on Gradient descent using the Fisher, we 
have the Natural Gradient Descent 

0{t+i) ^ 0it) -rjFg^VL{9^^^) 

FIM Gradients 

► It converges much faster than ordinary Gradient descent; 

► It moves on the distribution space manifold, invariant with respect to 
all differentiable and invertible transformations 

► Given that the FIM is the result of an outer-product, it is always PSD 
(positive semidefinite matrix); 

► It is still a n X n matrix, that needs to be inverted; 


'^Amari, “Natural Gradient Works Efficiently in Learning’\ 1998 

‘^Leon Bottou, Curtis, and Nocedal, Optimization methodsfor large-scale machine learnings 2018 
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Natural Gradient 

The natural gradient is connected to information geometry 

► In a Euclidean space, the shortest path between two points is always 
the straight line; 


^°Amari, “Information geometry and its applications”^ 2016 

^’Leon Bottou, Curtis, and Nocedal, Optimization methodsfor large-scale machine learnings 2018 
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than one between two points; 
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^’Leon Bottou, Curtis, and Nocedal, Optimization methodsfor large-scale machine learnings 2018 
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► In a Riemannian space, the shortest path between two points 
(minimal geodesic) can have a curvature and sometimes there is more 
than one between two points; 

► The metric tensor represents this curvature and can be different at 
different points; 
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Natural Gradient 

The natural gradient is connected to information geometry 

► In a Euclidean space, the shortest path between two points is always 
the straight line; 

► In a Riemannian space, the shortest path between two points 
(minimal geodesic) can have a curvature and sometimes there is more 
than one between two points; 

► The metric tensor represents this curvature and can be different at 
different points; 

► With the natural gradient, we are moving in this Riemannian 
manifold using the Fisher as the metric tensor; 


^°Amari, “Information geometry and its applications”^ 2016 

^’Leon Bottou, Curtis, and Nocedal, Optimization methodsfor large-scale machine learnings 2018 
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Natural Gradient 

The natural gradient is connected to information geometry 

► In a Euclidean space, the shortest path between two points is always 
the straight line; 

► In a Riemannian space, the shortest path between two points 
(minimal geodesic) can have a curvature and sometimes there is more 
than one between two points; 

► The metric tensor represents this curvature and can be different at 
different points; 

► With the natural gradient, we are moving in this Riemannian 
manifold using the Fisher as the metric tensor; 

► Parameters move more quickly along directions that have a small 
impact on the decision function, and more cautiously along directions 
that have a large impact 

^°Amari, “Information geometry and its applications”^ 2016 

^’Leon Bottou, Curtis, and Nocedal, Optimization methodsfor large-scale machine learnings 2018 
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► A manifold is a collection of 
points, where locally (but not 
globally), is Euclidean; 
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each point on the manifold; 
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► A manifold is a collection of 
points, where locally (but not 
globally), is Euclidean; 

► A metric induces an inner 
product on the tangent space at 
each point on the manifold; 
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manifold is unique, it is an 
intrinsic geometry; 
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► A manifold is a collection of 
points, where locally (but not 
globally), is Euclidean; 

► A metric induces an inner 
product on the tangent space at 
each point on the manifold; 

► The metric on the statistical 
manifold is unique, it is an 
intrinsic geometry; 

► In Euclidean space we don’t 
care because the metric is 
constant everywhere; 
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Empirical Fisher 


There is a lot of confusion^^ about the Fisher Information Matrix 

► In some scenarios you will see people sampling y ~ too instead 
of sampling from the model’s predictive distribution y ~ Pe{y\x)', 


“I blame evil people who omit expectation qualifiers about where y is coming from. 
Kunstner, Balles, and Hennig, “Limitations of the empirical fisher approximation for natural 
gradient descent”, 2019 
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Empirical Fisher 

There is a lot of confusion^^ about the Fisher Information Matrix 

► In some scenarios you will see people sampling y ~ too instead 
of sampling from the model’s predictive distribution y ~ Pe{y\x)', 

► This is called the Empirical Fisher, Empirical FIM or just EE; 

1 ^ 

e'^ogpe{yi\xi)V e\ogpe{yi\xiy 

2=1 


blame evil people v^^ho omit expectation qualifiers about v^^here y is coming from. 
Kunstner, Balles, and Hennig, “Limitations of the empirical fisher approximation for natural 
gradient descent”, 2019 
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Empirical Fisher 

There is a lot of confusion^^ about the Fisher Information Matrix 

► In some scenarios you will see people sampling y ~ too instead 
of sampling from the model’s predictive distribution y ~ Pe{y\x)', 

► This is called the Empirical Fisher, Empirical FIM or just EE; 

1 ^ 

e'^ogpe{yi\xi)V e\ogpe{yi\xiy 

i=l 

► It turns out that Adam is using the Empirical Fisher, and to make 
things more confusing it is using the square root of it. 


“I blame evil people who omit expectation qualifiers about where y is coming from. 
Kunstner, Balles, and Hennig, “Limitations of the empirical fisher approximation for natural 
gradient descent”, 2019 
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Adam and the Natural Gradient Descent 

Original Adam paper claims that Adam is an approximation to the 
natural gradient descent (diagonal of the FIM): 


9t 

rtit 

vt 


Ot 


Ve/t(0t-i) 

Pi ■ mt-i + (1 - Pi) ■ gt 


p 2 ■ Vt-i + (1 - P 2 ) ■ 9t 


it-i - a ■ 


mt 




+ e 


Kingma and Ba, “Adam: a Method for Stochastic Optimization ”, 2015 
^^Staib et al., “Escaping saddle points with adaptive gradient methods”, 2019 
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Adam and the Natural Gradient Descent 

Original Adam paper claims that Adam is an approximation to the 
natural gradient descent (diagonal of the FIM): 


gt ^ VeWt-i) 
mt /3i • mt-i + (1 - A) • gt 


vt 


h ■ vt-i + (1 - 132) ■ gt 


it-i - a ■ 


mt 


'J^t 


+ e 


However, the approximation is only valid near optimality (why ?). The 
exponent is also different, since Adam is taking square root, it doesn’t 
change direction of the descent (only stepsize) 

Kingma and Ba, “Adam: a Method for Stochastic Optimization ”, 2015 
^^Staib et al., “Escaping saddle points with adaptive gradient methods”, 2019 
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Empirical Fisher 


Dataset GD NGD EF 



Source: Kunstner, F, Balles, L., ^ Hennig, P. Limitations of the Empirical Fisher Approximation for 
Natural Gradient Descent, zoip. https://arxiv.org/abs/ipo$.i2^$8. 


► Vector fields of the gradients conditioned using the FIM vs using the 
EF are very different; 

► Are they close to each other close to the minima ? 
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Empirical Fisher 


Correct Misspecified (A) Misspecified (B) 









Loss contour ■ Fisher emp. Fisher * Minimum 


Source: Kunstner, F, Balles, L., ^Hennig, P Limitations of the Empirical Fisher Approximation for 
Natural Gradient Descent. 2oip. https://arxiv.org/abs/i^o$.i2$$8. 

► EF is a good approximadon of the Fisher at the minimum if model is 
well-specified. Otherwise, even at the minimum and with a large amount of 
samples, it can be a very poor approximation 

► Is EF just the non-central gradient covariance matrix, working as variance 
_ reduction instead of curvature ad aptation ? 

Kunstner, Balles, and Hennig, “Limitations of the empirical fisher approximation for natural 
gradient descent”, 2019 
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The epsilon that might not be an epsilon 

Many implementations use the epsilon to avoid division by zero: 


gt 

mt 

vt 

Ot 


/3i • nit-i + (1 - /?i) • gt 
h ■ vt-i + (1 - h) ■ gt 
rht 


Ot-i - a 


0 


^^Choi et al., On empirical comparisons of optimizersfor deep learnings 2019 
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The epsilon that might not be an epsilon 

Many implementations use the epsilon to avoid division by zero: 


gt ^ 

mt^ I3i- rrit-i + (1 - /?i) • gt 

Vt^ 132- Vt-l + (1 - f32) ■ S'? 


6 t -tr- Ot -1 — Ot ■ 


rht 



However, remember about the damping mechanism ? The e can be seen as 
setting a trust region radius 


^^Choi et al., On empirical comparisons of optimizersfor deep learnings 2019 
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Fisher is a big Fisher 

Computing the inverse of the diagonal Fisher is easy, but computing the 
inverse of the “full” Fisher and the natural gradient on 

networks with millions of parameters, is just intractable. 


^ Martens and Grosse, “Optimizing neural networks with Kronecker-factored approximate 
curvature”, 2015 
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Fisher is a big Fisher 

Computing the inverse of the diagonal Fisher is easy, but computing the 
inverse of the “full” Fisher and the natural gradient on 

networks with millions of parameters, is just intractable. 

► What about other structural approximations ? We don’t want to lose 
all of the off-diagonal structure; 


^ Martens and Grosse, “Optimizing neural networks with Kronecker-factored approximate 
curvature”, 2015 
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Fisher is a big Fisher 

Computing the inverse of the diagonal Fisher is easy, but computing the 
inverse of the “full” Fisher and the natural gradient on 

networks with millions of parameters, is just intractable. 

► What about other structural approximations ? We don’t want to lose 
all of the off-diagonal structure; 

► However, there are certain goals that we should be ideally try to 

achieve: memory (remember we have F G and computation 

(we want to have an efficient F~^); 


^ Martens and Grosse, “Optimizing neural networks with Kronecker-factored approximate 
curvature”, 2015 
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Fisher is a big Fisher 

Computing the inverse of the diagonal Fisher is easy, but computing the 
inverse of the “full” Fisher and the natural gradient on 

networks with millions of parameters, is just intractable. 

► What about other structural approximations ? We don’t want to lose 
all of the off-diagonal structure; 

► However, there are certain goals that we should be ideally try to 

achieve: memory (remember we have F G and computation 

(we want to have an efficient F“^); 

► That is what Kronecker-Factored Approximate Curvature 
(K-FAC) proposes, an structured approximation to natural gradient 
descent; 


^ Martens and Grosse, “Optimizing neural networks with Kronecker-factored approximate 
curvature”, 2015 
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Kronecker product 


[Ali.iB 


g R™ 


A g jjmxT. g g jjaxb . Kronecker factors 





Source: Kazuki Osawa. Introducing k-fac: A second-order optimization method for large-scale deep 
learning, 2018. 
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Kronecker product 


/ [Ali.iB 

A ® B I : 

V 


. ggroaxni 


A g jjmxT. g g jjaxb . Kronecker factors 





Source: Kazuki Osawa. Introducing k-fac: A second-order optimization method for large-scale deep 
learning, 2018. 
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Kronecker product 






“■“Tial.b 




A g jjmxT. g g jjaxb. Kronecker factors 




Source: Kazuki Osawa. Introducing k-fac: A second-order optimization method for large-scale deep 
learning, 2018. 
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Fisher approximation 



Source: Kazuki Osawa. Introducing k-fac: A second-order optimization method for large-scale deep 
learning, 2018. 
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Fisher approximation 
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o 
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60.000,000 
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Fisher information matrix 
(FIM)ofAlexNet 
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4,096,000 


1 


Diagonal block of 
the last AlexNet 
layer 



1,000 
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Kronecker factorization 


Source: Kazuki Osawa. Introducing k-fac: A second-order optimization method for large-scale deep 
learning, 2018. 
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Fisher approximation 



Block-diagonal FIM 

(4 blocks) 


Layer-wise FIM 

A 


Block-tri-diagonal FIM 

(10 blocks) 


K-FAC 

(For each diagonal block) 


O O 



O O O 
O O 




Source: Osawa, K. etal. Understanding Approximate Fisher Information for Fast Convergence of 
Natural Gradient Descent in Wide Neural Networks, 2020. 
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Kronecker Inversion 

Kronecker product has a very interesting and critical property: 


(A(8)B)"^ = (8)B~^ 

This means that the inverse of the product is the same as the product of the 
inverse of the operands. And this gives us a critical performance speed-up 
because we just need to invert small factor matrices. 
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Backpack in PyTorch 

If you want to play with K-FAC on PyTorch, you can try using Backpack 


from torch import nn 

from backpack import backpack, extend 
from backpack.extensions import KFAC 
from backpack.utils.examples import load_one_batch. 
from backpack.utils import kroneckers 

X, y = load_one_batch_mnist(batch_size=512) 

model = nn.SequentiaK 
nn.FlattenO , 
nn.Linear (784, 10) 

) 


named_params = diet (model.named_parameters()) 
layer_weights = named_pcLrams [" 1. weight "] 

# layer_weights.grad = [10, 784] 

|j[ac fl, kfac f2 - layer weights .kf ac 

# kfac.fl = [10, 10] 

# kfac_f2 = [784, 784] 


lossfunc = nn.CrossEntropyLoss0 


■lAel - extend(model) 

^ossfunc - extend(lossfunc) 


■it = kroneckers . two_kf acs_to_mat (kfac_fl, 
~ __ kfac_f2) 

# mat = [7840, 7840] 


loss = lossfunc(model(X), y) 

gith backpack(KFAC(mc_samples=l)): 
loss .backwardO 


^^Dangel, Kunstner, and Hennig, “BackPACK: Packing more into backprop”, 2019 
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Kronecker Matrices 




G(0) G M7840X7840 


Note that the colormap of the G(0) was changed for visualization 
purposes. 
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Some empirical results 


SVHN, train + extra 

K-FAC. bz 256 
K-FAC, bz 1024 


105 - K-FAC. bz 4095 



070 


50 


100 150 200 

Number of Epochs 


Source: Johnson, M. etal. K-FAC and Natural Gradients, zoiy. 
https://supercomputersfordlzoiy.github. io/Presentations/K-FA Cpdf. 
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Section V 


«/» Thoughts ev® 
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Benchmarking optimizers 



Schedule: const. cosine wr cosine trapez. const. cosine wr cosine trapez. const. cosine wr cosine trapez. 


Source: Schmidt, R. M., Schneider, R, Hennig, P. (2020). Descending through a Crowded Valley - 
Benchmarking Deep Learning Optimizers. 

Lines in gray ( , smoothed by cubic splines for visual guidance only) show the relative improvement 

for a certain tuning and schedule (compared to the one-shot tuning without schedule) for all 14 
optimizers on all eight test problems. The median over all lines is plotted in orange ( — ) with the 
shaded area indicating the area between the 25th and 75th percentile. 

Schmidt, Schneider, and Hennig, “Descending through a Crowded Valley - Benchmarking Deep 
Learning Optimizers ”, 2020 
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To THINK 

► Do we really need normalization techniques (i.e. Batch 

Normalization) if we can come up with optimization methods that 
embed invariant properties ? 
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To THINK 

► Do we really need normalization techniques (i.e. Batch 
Normalization) if we can come up with optimization methods that 
embed invariant properties ? 

► What are the other difficult problems we can optimize with better 
optimization algorithms ? 







Introduction Gradient Descent Adaptation and Preconditioning Natural Gradient Thoughts 

OOOOOOOOOOOO OOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOOO OOOOOOOOOOOOOOOOOOOO 00*00000 


To THINK 

► Do we really need normalization techniques (i.e. Batch 
Normalization) if we can come up with optimization methods that 
embed invariant properties ? 

► What are the other difficult problems we can optimize with better 
optimization algorithms ? 

► What other approximations can we achieve ? 
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To THINK 

► Do we really need normalization techniques (i.e. Batch 
Normalization) if we can come up with optimization methods that 
embed invariant properties ? 

► What are the other difficult problems we can optimize with better 
optimization algorithms ? 

► What other approximations can we achieve ? 

► What is empirical Fisher actually doing ? 
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To THINK 

► Do we really need normalization techniques (i.e. Batch 
Normalization) if we can come up with optimization methods that 
embed invariant properties ? 

► What are the other difficult problems we can optimize with better 
optimization algorithms ? 

► What other approximations can we achieve ? 

► What is empirical Fisher actually doing ? 

► What are the barriers to the use of second-order or approximately 
second-order methods ? Are we going to see more software support ? 
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To THINK 

► Do we really need normalization techniques (i.e. Batch 
Normalization) if we can come up with optimization methods that 
embed invariant properties ? 

► What are the other difficult problems we can optimize with better 
optimization algorithms ? 

► What other approximations can we achieve ? 

► What is empirical Fisher actually doing ? 

► What are the barriers to the use of second-order or approximately 
second-order methods ? Are we going to see more software support ? 

► Are we driving towards more hyper-parameters or more robust 
methods ? 
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To THINK 

► Do we really need normalization techniques (i.e. Batch 
Normalization) if we can come up with optimization methods that 
embed invariant properties ? 

► What are the other difficult problems we can optimize with better 
optimization algorithms ? 

► What other approximations can we achieve ? 

► What is empirical Fisher actually doing ? 

► What are the barriers to the use of second-order or approximately 
second-order methods ? Are we going to see more software support ? 

► Are we driving towards more hyper-parameters or more robust 
methods ? 

► What are properties of the different solutions that different 
optimization methods can achieve ? 
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Q&A 
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